====================================================================================================================

PART ONE

====================================================================================================================

• DOMAIN: Semiconductor manufacturing process

• CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/ variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

• DATA DESCRIPTION: sensor-data.csv : (1567, 592)

The data consists of 1567 examples each with 591 features.

The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing.

Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.

• PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.


1. Import and explore the data.

Here we have 1567 rows and 592 columns


2. Data cleansing:

• Missing value treatment.

• Drop attribute/s if required using relevant functional knowledge.

• Make all relevant modifications on the data using both functional/logical reasoning/assumptions

• Missing value treatment.

We have 41951 missing values, which is huge

Maximum null value in a column is 1429.

We have lots of columns with large numbers of missing values

Replacing all the NaN values with 0 as the values correspond to the test results.

Since, the values are not present that means the values are not available or calculated.

Lets Drop columns with more than 25% missing values

Absence of a signal is assumed to be no signal in the dataset

So better we not take median or mean and replace them with zeros

We have dropped the columns with very high number of missing values. And replace the nan with 0

• Drop attribute/s if required using relevant functional knowledge.

• Make all relevant modifications on the data using both functional/logical reasoning/assumptions

We can see that there are few highly correlated data present. Hence we can remove the correlated data.


3. Data analysis & visualisation:

• Perform detailed relevant statistical analysis on the data.

• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

Here we have very diverse and unscaled data present.

We have large number of data of -1 and very less data on 1

There are total 1567 rows

We can see very diverse standard deviations

Lets analyse the pass fail criteria

Here we can see that the data is highly unbalanced.

Target column “ –1” corresponds to a pass and “1” corresponds to a fail.

We have large number of data on pass and very less data on fail

As we can see that there is large amount skewness and outliers in the data

As we can see that many of the columns are not good in deciding the pass/fails

The distribution is overlapping for pass and fails

Since we have already removed highly correlated data, we can see that there are lesser correlated data


4. Data pre-processing:

• Segregate predictors vs target attributes

• Check for target balancing and fix it if found imbalanced.

• Perform train-test split and standardise the data or vice versa if required.

• Check if the train and test data have similar statistical characteristics when compared with original data.

• Check for target balancing and fix it if found imbalanced.

• Perform train-test split and standardise the data or vice versa if required.

We have included scaling in the pipeline, Hence we wont be scaling the sampled data

• Check if the train and test data have similar statistical characteristics when compared with original data.

Here we can see that the training, testing, and initial data are having similar statistical distribution

The training, testing, and initial data are having counts of pass/fail

Hence we can say that the training, testing, and initial data are habing similar statistical characteristics when compared with original data.


5. Model training, testing and tuning:

• Model training:

- Pick up a supervised learning model.
- Train the model.
- Use cross validation techniques.
        Hint: Use all CV techniques that you have learnt in the course.
- Apply hyper-parameter tuning techniques to get the best accuracy.
        Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
- Use any other technique/method which can enhance the model performance.
        Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.
- Display and explain the classification report in detail.
- Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.
        Hint: You can use your concepts learnt under Applied Statistics module.
- Apply the above steps for all possible models that you have learnt so far.

• Display and compare all the models designed with their train and test accuracies.

• Select the final best trained model along with your detailed comments for selecting this model.

• Pickle the selected model for future use.

• Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results.

• Model training:

- Pick up a supervised learning model.
- Train the model.
- Use cross validation techniques.
        Hint: Use all CV techniques that you have learnt in the course.
- Apply hyper-parameter tuning techniques to get the best accuracy.
        Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
- Use any other technique/method which can enhance the model performance.
        Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.
- Display and explain the classification report in detail.
- Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.
        Hint: You can use your concepts learnt under Applied Statistics module.
- Apply the above steps for all possible models that you have learnt so far.

- Pick up a supervised learning model.

- Train the model.

- Use cross validation techniques.

- Apply hyper-parameter tuning techniques to get the best accuracy.

Logistic regression :

Here we can see that accuracy is 88.32, which is fair

We can see that recall is higher than precision here. Precision quantifies the number of positive class predictions that actually belong to the positive class. Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.

F1 score here is fair enough

Lets iterate over the other model and find out the comapare the models

Here we can see that SVC and LGB is giving best results

SVC and LGB has better accuracy, f1score , precision and recall when compared to other models

SVC f1 score = 0.986048 LGB f1 score = 0.979561

Let use these 2 model for future comparision and analysis

We can see that with parameters 'classificationkernel': 'rbf', 'classificationgamma': 0.01, 'classification__C': 1000, we are getting F1 for SVC is 0.9988609281489563

We can see that with parameters 'classificationsubsample_freq': 20, 'classificationsubsample': 0.7, 'classificationreg_lambda': 1.3, 'classificationreg_alpha': 1.1, 'classificationnum_leaves': 50, 'classificationn_estimators': 700, 'classificationmin_split_gain': 0.4, 'classificationmax_depth': 15, 'classification__colsample_bytree': 0.7 , we are getting F1 for LGB Classifier is 0.9703718701491529

Hence we can infer on above analysis that with SVC we are getting best results among the models we have taken into consideration

Grid search cv also providing similar results but the random search cv has letter time of execution

Grid search cv is taking far more execution time


We have already tried out attribute removal,standardisation/normalisation, target balancing. Lets try dimensionality reduction and attribute removal

After reducing the dimension from 296 to 25, we are getting f1 score of 0.9931661451766062 for SVC. This is good enough f1 score

We were able to reduce the dimension significantly here.


- Display and explain the classification report in detail.

After the demensionality reduction the results are :

With SVM accuracy of train data is: 1.0

With SVM accuracy of test data is: 0.9931662870159453

Confusion Matrix

              precision    recall  f1-score   support

          -1       1.00      0.99      0.99       444
           1       0.99      1.00      0.99       434

    accuracy                           0.99       878
   macro avg       0.99      0.99      0.99       878
weighted avg       0.99      0.99      0.99       878

- Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.

Feature Engineering

Here we can see that features like 59 is of highest importance. The above are the features that are most important among the all the features

ANSWER :

We have already used the stratified k fold technique and found out the result as below :

Here wee have tried out different training and testing dataset for the purpose with stratified k fold .

MODEL ACCURACY F1 PRECISION RECALL
SVC 0.985839 0.986116 0.972641 1.000000
LGB 0.980471 0.980428 0.985463 0.975714
GradientBoost 0.964353 0.964863 0.956153 0.973758
RandomForest 0.950193 0.949738 0.963232 0.936851
xgboost 0.947269 0.947777 0.943404 0.952399
Logistic 0.881832 0.889449 0.839742 0.945546
KNeighborsClassifier 0.610844 0.720852 0.563549 1.000000

We also designed and tried out tuning with different parameter over range of training and testing data through Random search and grid search cv

We have tried feature engineering and dimensionality reduction also.


CONCLUSION ON MODEL SELECTION :

• Display and compare all the models designed with their train and test accuracies.

• Select the final best trained model along with your detailed comments for selecting this model.

Reference:

Precision: When it predicts the positive result, how often is it correct? i.e. limit the number of false positives.

Recall: When it is actually the positive result, how often does it predict correctly? i.e. limit the number of false negatives.

Precision quantifies the number of positive class predictions that actually belong to the positive class. Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.

Comparing All Models:

MODEL ACCURACY F1 PRECISION RECALL
SVC 0.987792 0.987989 0.977201 0.999029
LGB 0.979974 0.980092 0.980860 0.979583
GradientBoost 0.964846 0.965202 0.959684 0.970860
RandomForest 0.953607 0.953359 0.964430 0.942652
xgboost 0.948732 0.949334 0.942626 0.956278
Logistic 0.877438 0.885142 0.836867 0.939749
KNeighborsClassifier 0.618148 0.724786 0.568462 1.000000

We have tried out below steps and conclusions are below :

Model training

We trained the above list of models and found out that SVM and Light gbm is giving us best results in terms of precision,recall, accuracy and f1 scores as seen in above table

Cross Validation Techniques.

We used different cross validation techniques like:

- GRID SEARCH CV
- RANDOM SEARCH CV

We have averaged out the result and publish as in th table

Here also SVC and LGB has given best results compared to other models

HYPER TUNING

We used hypertuning models to find out best parameters:

SVC... Best parameter for SVC is {'classificationkernel': 'rbf', 'classificationgamma': 0.01, 'classification__C': 1000} Best F1 for SVC is 0.9988609281489563


LGB Classifier... Best parameter for LGB Classifier is {'classificationsubsample_freq': 20, 'classificationsubsample': 0.7, 'classificationreg_lambda': 1.3, 'classificationreg_alpha': 1.1, 'classificationnum_leaves': 50, 'classificationn_estimators': 700, 'classificationmin_split_gain': 0.4, 'classificationmax_depth': 15, 'classification__colsample_bytree': 0.7} Best F1 for LGB Classifier is 0.9703718701491529


DIMENSIONALITY REDUCTION

After reducing the dimension from 296 to 25, we are getting f1 score of 0.9931661451766062 for SVC. This is good enough f1 score

We were able to reduce the dimension significantly here.

ATTRIBUTE REMOVAL

We removed highly correlated data with collinearity greater than 0.7

We removed the columns with more than 25% missing data

Standardisation/Normalisation

We standardised the data and scaled it.

Target Balancing

We performed target balancing with SMOTE and balanced the data. We had checked the statistical characteristics of training, testing and original data

Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.

We have used the stratified k fold technique and found out the result.

Here wee have tried out different training and testing dataset for the purpose with stratified k fold .

We also designed and tried out tuning with different parameter over range of training and testing data through Random search and grid search cv


CONCLUSION

SVM has shown the best results among the model with the f1 of 0.9988609281489563

OTHER SVM results :

With SVM accuracy of train data is: 1.0

With SVM accuracy of test data is: 0.9931662870159453

Confusion Matrix

              precision    recall  f1-score   support

          -1       1.00      0.99      0.99       444
           1       0.99      1.00      0.99       434

    accuracy                           0.99       878
   macro avg       0.99      0.99      0.99       878
weighted avg       0.99      0.99      0.99       878

• Pickle the selected model for future use.

• Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results.

Here by using the predict_result function and passing the saves model file and data file, we get the predicted results.


6. Conclusion and improvisation:

• Write your conclusion on the results.

SVM has shown the best results among the model with the f1 of 0.9988609281489563

OTHER SVM results :

With SVM accuracy of train data is: 1.0

With SVM accuracy of test data is: 0.9931662870159453

Confusion Matrix

              precision    recall  f1-score   support

          -1       1.00      0.99      0.99       444
           1       0.99      1.00      0.99       434

    accuracy                           0.99       878
   macro avg       0.99      0.99      0.99       878
weighted avg       0.99      0.99      0.99       878

We can conclude from all the findings that SVM is best model as per our analysis and has provided good confusion matrix parameters and results

Improvisation

We have lots of missing data in file

We have lots of higly correlated features

As we can see in the data profiling there are lots of warnings for missing data, high skewness, zeroes and others

Columns and signal data have no detail or functional details

Many columns/features have large number of missing data >75%

Data has huge imbalance. We have very high number of pass data points and very less fail data points

Data are very much skewed

===========================================================================================

END